Chromosome-level genome assembly and annotation of Zicaitai (Brassica rapa var. purpuraria)

Zicaitai is a seasonal vegetable known for its high anthocyanin content in both stalks and leaves, yet its reference genome has not been published to date. Here, we generated the first chromosome-level genome assembly of Zicaitai using a combination of PacBio long-reads, Illumina short-reads, and Hi-C sequencing techniques. The final genome length is 474.12 Mb with a scaffold N50 length of 43.82 Mb, a BUSCO score of 99.30% and the LAI score of 10.14. Repetitive elements accounted for 60.89% (288.72 Mb) of the genome, and Hi-C data enabled the allocation of 430.87 Mb of genome sequences to ten pseudochromosomes. A total of 42,051 protein-coding genes were successfully predicted using multiple methods, of which 99.74% were functionally annotated. Notably, comparing the genome of Zicaitai with seven other species in the Cruciferae family revealed strong conservation in terms of gene numbers and structures. Overall, the high-quality genome assembly provides a critical resource for studying the genetic basis of important agronomic traits in Zicaitai.


Background and Summary
Brassica rapa var.purpuraria (NCBI: txid386281, Fig. 1) belongs to the Cruciferae family 1,2 and is named "Zicaitai" for its purple stalks 3 .Zicaitai originated in the southern regions of China and then spread to the Yangtze River Basin, where it was subsequently widely domesticated.Its cultivation history can be traced back to ancient China, spanning more than a thousand years.To date, Zicaitai is a popular vegetable in China, and is also exported to other countries in Asia, Europe, and America.
Nowadays, some candidate genes related to anthocyanin biosynthesis have been identified 7,13,[15][16][17][18][19] .For example, Hayashi et al. 9 have crossed a doubled haploid line of the turnip Brassica rapa cv.'Iyo-hikabu' , which is pigmented with anthocyanin, with a Chinese cabbage inbred line, 'Y54' , which lacks anthocyanin pigmentation, and identified a novel locus (Anp) on chromosome A07 9 related with anthocyanin synthesis based on a bulked segregant analysis.Burdzinski and Wendell (2007) identified three markers linked to anthocyaninless, forming a linkage group spanning 46.9 cM, which were assigned to Brassica rapa linkage group R09 20 based on 177 F2 offspring.Additionally, the pur gene, responsible for regulating purple leaf color, was successfully mapped to the end of chromosome A03 using an F2 population 21 .An insertion and deletion (InDel) marker developed from deletion/insertion in the promoter region of bHLH49 in the F2 population was found to significantly correlate with the stalk color trait 2 .EGL3, a positive regulator gene with potentially epistatic function, was localized to mediate the anthocyanin biosynthesis 1 .Two candidate genes controlling anthocyanin accumulation were identified in the F2 population derived from a cross between Zicaitai and caixin 5 , and they were homologous with AtEGL3, BrEGL3.1 and BrEGL3.2genes.Although several studies have characterized anthocyanin in Brassica crops, there is limited information on the genes involved in anthocyanin biosynthesis in Zicaitai.
Understanding the genome structure and identifying candidate genes related to anthocyanin biosynthesis is crucial for Zicaitai.However, the lack of a high-quality reference genome for Zicaitai makes it challenging to identify candidate genes associated with important agronomic traits and breed excellent Zicaitai varieties.Hence, we have generated a chromosome-level genome assembly of Zicaitai using PacBio long-reads, Illumina short reads, and Hi-C sequencing data first in this study.The assembled genome has a total length of 474.12 Mb, with a scaffold N50 length of 43.82 Mb, and 90.88% of the genome sequence was successfully anchored onto ten pseudochromosomes.Through a combination of ab initio gene predictions, RNA-seq, and homologous protein evidence, a total of 42,051 protein-coding genes were identified, and 41,942 of them were functionally annotated.The genome sequence provides a valuable resource for exploring the molecular basis of agronomic traits in Zicaitai and will further facilitate its genetic improvements.

Sample collection.
Young fresh leaves of Zicaitai were collected from one sample individual grown in a greenhouse in Guangzhou, Guangdong, China (N 23°06' , E 113°15'), and immediately frozen in liquid nitrogen for genomic DNA and RNA extraction.

DNA extraction and sequencing. Total high molecular weight (HMW) genomic DNA was extracted from
Zicaitai young fresh leaves using the Tiangen Extraction Kit (Tiangen Biotech (Beijing) Co., Ltd.) for whole genome sequencing.The extraction process was according to the cetyltrimethylammonium bromide (CTAB) method, and concentration was ascertained by the Quant-iT PicoGreen ® assay (Invitrogen, Waltham, MA, USA).
The quality and quantity of the DNA samples were assessed using an ultraviolet spectrophotometer at 260 nm and 280 nm wave lengths.The DNA was fragmented with a Covaris M220 Focused-ultrasonicator Instrument.For genomic DNA sequencing, we employed three different approaches at Novogene Co., Ltd., Beijing, China.Firstly, DNA PCR-free libraries with insert sizes of 350 bp were constructed using the NEBNext Ultra DNA library Pre-Kit for Illumina short-reads sequencing.The resulting barcoded library were sequenced on the Illumina Hiseq 4000 platform to generate paired-end 150-bp reads.Subsequently, all the obtained reads were quality controlled by trimming adaptor sequences and low-quality reads using NGSQC v2.3 22 (-q 20, https://github.com/mjain-lab/NGSQCToolkit).Secondly, single-molecule real-time (SMRT) PacBio libraries were constructed using the PacBio 15-kb protocol and sequenced with a Pacbio Sequel IIe platform.Lastly, the Hi-C library was generated using the restriction endonuclease DpnII.The DpnII-digested chromatin was labeled with biotin-14-dATP, and in situ DNA ligation was performed.The DNA underwent extraction, purification, and shearing.After A-tailing, pull-down, and adapter ligation, the DNA library was sequenced on the Illumina Hiseq 4000 platform.The total data generated from long-read sequencing was 35.29 Gb, and the total data generated from short-read sequencing was 10.06 Gb (Table 1).
RNA extraction and sequencing.Simultaneously, fresh root, stem, leaf, flower, and pod of the same Zicaitai individual were collected for transcriptome sequencing.Total RNA was extracted using the TRIzol reagent (Thermo Fisher Scientific, MA, USA) according to the manufacturer's protocol.RNA-seq library were sequenced on an Illumina Novaseq 6000 platform with paired-end 150 bp reads.The adapters and low-quality reads of the raw sequence reads were trimmed using NGSQC v2.3 22 (-q 20, https://github.com/mjain-lab/NGSQCToolkit).A total of ~15 Gb raw reads were yielded and used for the gene prediction.
For the genome assembly of Zicaitai, Hifiasm v0.16.1 26 was first used to assemble the initial assembly with PacBio CCS sequences.Secondly, NextPolish v1.3.1 27 was then applied three times to polish the draft genome assembly using Illumina short reads.Then, ALLHiC 28 , a Hi-C scaffolding pipeline, was used to align Hi-C reads to the draft assembly and subject them to quality control.Finally, 3D-DNA v180419 29 was used to anchor primary contigs into chromosomes, and ambiguous fragments were removed manually with Juicebox v1.13 30 , a visualization software for Hi-C data.All of the above software runs with default parameters.The final genome assembly of Zicaitai was 474.12 Mb with a scaffold N50 of 43.82 Mb.The Hi-C analyses scaffolded ten pseudo-chromosomes (Fig. 2a), anchoring 90.88% (430.88Mb) of the genome assembly.The average GC content of Zicaitai genome assembly was 38.54% (Table 2, Fig. 2b).

Genome annotation.
The repetitive elements, protein-coding genes, and non-coding RNAs (ncRNAs) of the Zicaitai genome was annotated.The whole genome repeats were identified using a combined strategy based on homology alignment and de novo search.Tandem Repeat was extracted using TRF 33 by ab initio prediction.
The homolog prediction commonly used Repbase 34 database employing RepeatMasker and RepeatProteinMask with default parameters to extract repeat regions.Ab initio prediction built the de novo repetitive elements database by LTR_FINDER 35 , RepeatScout 36 , and RepeatModeler2 37 (-LTRStruct), and then all repeat sequences with lengths more than 100 bp and gap 'N' less than 5% constituted the raw transposable element (TE) library.A total of 288.72 Mb repetitive elements were identified, constituting 60.89% of the total genome sequence.The most abundant repeating element was long terminal repeats (LTR, 47.96%), and unknown repeats (43.71%), followed by DNA transposons (6.08%) (Fig. 2b, Table 3, Fig. 3).

Fig. 4
Prediction and annotation of protein-coding genes in the Zicaitai genome.De novo gene prediction, homology-based prediction, and RNA-seq were applied for the annotation of protein-coding genes.The repeat-masked genome was analyzed using Augustus v2.4 38 , GlimmerHMM v3.0.4 39 , Geneid v1.4.5 40 and Genscan 41 for de novo gene prediction.The protein sequences of Cruciferae species were downloaded from the NCBI Database as references for homology-based prediction.Transcriptome assemblies were also generated with Trinity v2.5.1 42 for the genome annotation.A consensus gene set was created by integrating the genes predicted by the aforementioned three methods using EVidenceModeler v1.1.1 43 .Finally, a total of 42,051 protein-coding genes were predicted for the Zicaitai genome by combining the evidence from the transcriptome, ab initio, and homology-based predictions (Fig. 4a).The average length of the predicted protein-coding gene was 2,001 bp.The average lengths of the exon and intron were 228 bp and 235 bp, respectively.Each gene has an average of 4.82 exons.(Table 4).Gene functions were assigned according to the best match by aligning the protein sequences to the Swiss-Prot, GO, NR, InterPro, Pfam, and KEGG databases, respectively.A total of 41,942 (99.74%) genes were functionally annotated (Table 4, Fig. 4b).

Protein-coding genes Number of genes with annotation
For tRNA prediction, the program tRNAscan-SE 44 was used, while for rRNA prediction, Blast 45 program was used with relative species' rRNA sequences as references.Other ncRNAs, including miRNAs and snRNAs, were identified by searching against the Rfam 46 database using the infernal v1.1 47 software with default parameters.Finally, a total of 1,894 non-coding RNAs were predicted, including 4,885 transfer RNAs (tRNAs), 6,402 ribosomal RNAs (rRNAs), 511 micro-RNAs (miRNAs), and 2,774 small nuclear RNAs (snRNAs) (Table 5, Fig. 2b).

Data Records
The Illumina short reads, PacBio long-reads, Hi-C sequencing data, and RNA-seq data used for the genome assembly and annotation have been deposited in the NCBI Sequence Read Archive (SRA) database with the accession number SRP441633 48 .The chromosomal-level genome assembly sequence and annotation information have been deposited in the Figshare database 49 .The chromosomal-level genome assembly sequence was deposited in the GenBank database of NCBI with accession number JAUJLN000000000 50 .

technical Validation
Evaluating the quality of the genome assembly.We evaluated the quality and completeness of the Zicaitai genome assembly using two approaches.First, we mapped short-reads to the genome to verify the accuracy, yielding mapping rates of 99.22%, which suggests the accuracy of the Zicaitai genome assembly.Second, BUSCO analysis found 99.30% of the 1,614 single-copy orthologues in the embryophyta_odb10 database to be complete (84.70% single-copied genes and 14.60% duplicated genes), with 0.2% fragmented and 0.5% missing (Table 4), indicating a remarkably complete assembly of the Zicaitai genome.Additionally, the whole-genome high long terminal repeat (LTR) assembly index (LAI) score is an important indicator of genome assembly quality and completeness.In this study, the LAI score for Zicaitai genome assembly was 10.14, indicating that the assembly quality of Zicaitai reached the reference genome level.
OrthoFinder v2.5.4 51 was used to infer sequence orthology.Phylogenetic trees were constructed using single copy gene family, and the differentiation time was estimated using the r8s program 52 .Based on the time tree, expansions and contractions of the gene family of Zicaitai and seven other Cruciferae species was estimated using CAFE v5 53 with a p value of 0.01.Finally, 236 and 222 gene families experienced expansions and contractions were found in Zicaitai, respectively (Fig. 6).Moreover, we have compared the genome sequences of Zicaitai and Brassica rape (Brara_Chiifu_V4.0), and the results indicated a significant degree of collinearity for the two genome sequences of Zicaitai and Brassica rape, with the exception of certain contigs located on chromosome 9 (Fig. 7).

Fig. 1
Fig. 1 Images of Zicaitai mature plant in laboratory research (a) and field experiment (b).

Fig. 2
Fig. 2 Hi-C chromatin interaction map and circos plot of the genome assembly.(a) Hi-C chromatin interaction map of the Zicaitai assembly.(b) The circos plot of genome characteristic of the Zicaitai.The rings from outside to inside indicate are: (a) GC content, (b) gene distribution, (c) TE density, and (d) ncRNA distribution in ten different chromosomes.

Fig. 3
Fig. 3 Repeat landscape plots illustrating TE accumulation history for Zicaitai genome, based on Kimura distance-based copy divergence analyses, with sequence divergence (CpG adjusted Kimura substitution level) illustrated on the x-axis, percentage of the genome represented by each TE type on the y-axis, and transposon type indicated by the colour chart on the right-hand side.CpG, region of DNA where a cytosine nucleotide is followed by a guanine nucleotide; LINE, long interspersed nuclear element; LTR, long terminal repeat; SINE, short interspersed nuclear element.

Fig. 5
Fig. 5 Genetic components of the Zicaitai genome and seven other Cruciferae species.

Fig. 6
Fig. 6 Phylogenetic tree and gene families expansion and contraction of Zicaitai and seven other Cruciferae species.The scale at the bottom of the figure represents the divergence time.

Fig. 7
Fig. 7 Dot plot represents an alignment of two different genomes of Brassica rapa (x-axis) and Zicaitai (y-axis).Forward matches are shown in red, while reverse matches are shown in blue.

Table 1 .
Statistics of sequencing data generated for the Zicaitai genome assembly.

Table 2 .
Summary statistics of the Zicaitai genome assembly.

Table 3 .
Statistics and classification of repetitive elements in the Zicaitai genome.

Table 4 .
Prediction and annotation of protein-coding genes in the Zicaitai genome.

Table 5 .
Statistics and classification of non-coding RNAs identified in the Zicaitai genome.